Business objective: predict total sales at the basic of historical data.
For validation, i try to predict sales, for which I already have information.
I take data from Kaggle Competition Predict Future Sales. Data contains information about sales from 60 shops.
Data contains 5 files:
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | |
|---|---|---|---|---|---|---|
| 0 | 02.01.2013 | 0 | 59 | 22154 | 999.000000 | 1.000000 |
| 1 | 03.01.2013 | 0 | 25 | 2552 | 899.000000 | 1.000000 |
| 2 | 05.01.2013 | 0 | 25 | 2552 | 899.000000 | -1.000000 |
| 3 | 06.01.2013 | 0 | 25 | 2554 | 1709.050000 | 1.000000 |
| 4 | 15.01.2013 | 0 | 25 | 2555 | 1099.000000 | 1.000000 |
| 5 | 10.01.2013 | 0 | 25 | 2564 | 349.000000 | 1.000000 |
| 6 | 02.01.2013 | 0 | 25 | 2565 | 549.000000 | 1.000000 |
| 7 | 04.01.2013 | 0 | 25 | 2572 | 239.000000 | 1.000000 |
| date_block_num | shop_id | item_id | item_price | item_cnt_day | |
|---|---|---|---|---|---|
| count | 2935849.000000 | 2935849.000000 | 2935849.000000 | 2935849.000000 | 2935849.000000 |
| mean | 14.569911 | 33.001728 | 10197.227057 | 890.853233 | 1.242641 |
| std | 9.422988 | 16.226973 | 6324.297354 | 1729.799631 | 2.618834 |
| min | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -22.000000 |
| 25% | 7.000000 | 22.000000 | 4476.000000 | 249.000000 | 1.000000 |
| 50% | 14.000000 | 31.000000 | 9343.000000 | 399.000000 | 1.000000 |
| 75% | 23.000000 | 47.000000 | 15684.000000 | 999.000000 | 1.000000 |
| max | 33.000000 | 59.000000 | 22169.000000 | 307980.000000 | 2169.000000 |
| 0 | |
|---|---|
| date | 0 |
| date_block_num | 0 |
| shop_id | 0 |
| item_id | 0 |
| item_price | 0 |
| shop_name | shop_id | |
|---|---|---|
| 0 | !Якутск Орджоникидзе, 56 фран | 0 |
| 1 | !Якутск ТЦ "Центральный" фран | 1 |
| 2 | Адыгея ТЦ "Мега" | 2 |
| 3 | Балашиха ТРК "Октябрь-Киномир" | 3 |
| 4 | Волжский ТЦ "Волга Молл" | 4 |
| item_name | item_id | item_category_id | |
|---|---|---|---|
| 0 | ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D | 0 | 40 |
| 1 | !ABBYY FineReader 12 Professional Edition Full [PC, Цифровая версия] | 1 | 76 |
| 2 | ***В ЛУЧАХ СЛАВЫ (UNV) D | 2 | 40 |
| 3 | ***ГОЛУБАЯ ВОЛНА (Univ) D | 3 | 40 |
| 4 | ***КОРОБКА (СТЕКЛО) D | 4 | 40 |
| item_category_name | item_category_id | |
|---|---|---|
| 0 | PC - Гарнитуры/Наушники | 0 |
| 1 | Аксессуары - PS2 | 1 |
| 2 | Аксессуары - PS3 | 2 |
| 3 | Аксессуары - PS4 | 3 |
| 4 | Аксессуары - PSP | 4 |
| ID | shop_id | item_id | |
|---|---|---|---|
| 0 | 0 | 5 | 5037 |
| 1 | 1 | 5 | 5320 |
| 2 | 2 | 5 | 5233 |
| 3 | 3 | 5 | 5232 |
| 4 | 4 | 5 | 5268 |
| Frame | Number of rows | Number of columns | |
|---|---|---|---|
| 0 | sales_train.csv | 2935849 | 6 |
| 1 | test.csv | 214200 | 3 |
| 2 | shops.csv | 60 | 2 |
| 3 | items.csv | 22170 | 3 |
| 4 | item_categories.csv | 84 | 2 |
Let's starts from choosing only one shop, because it more realistic situation, when one shop will ask somebody to predict sales for them instead for bunch of stores.
I choose shop with id 31, which is Semenovskiy Shopping & Entertainment Center in Moscow.
Since i want to predict only on basic on previous sales, i can drop item_price columns, and of course shop_id.
I change date format, and also add column, in which i store date without days(so basicly copy of date_block_num column)
| date | date_block_num | item_id | item_cnt_day | date_with_day | |
|---|---|---|---|---|---|
| 60143 | 2013-01 | 0 | 4906 | 2.000000 | 2013-01-03 |
| 60144 | 2013-01 | 0 | 4906 | 1.000000 | 2013-01-02 |
| 60145 | 2013-01 | 0 | 4890 | 1.000000 | 2013-01-11 |
| 60146 | 2013-01 | 0 | 4901 | 1.000000 | 2013-01-26 |
| 60147 | 2013-01 | 0 | 4901 | 1.000000 | 2013-01-25 |
Lets look how sales look in each month.
Marked areas are December. Despite fact, that Russia have their Christmas in January, the peak of sales are near our.
Let's check witch item people buy in this shop most often.
| item_id | date_block_num | item_cnt_day | item_name | item_category_id | item_category_name | |
|---|---|---|---|---|---|---|
| 14410 | 20949 | 15564 | 19934.000000 | Фирменный пакет майка 1С Интерес белый (34*42) 45 мкм | 71 | Подарки - Сумки, Альбомы, Коврики д/мыши |
| 13138 | 5822 | 10568 | 1293.000000 | Playstation Store пополнение бумажника: Карта оплаты 1000 руб. | 35 | Карты оплаты - PSN |
| 14411 | 17717 | 10586 | 1290.000000 | Прием денежных средств для 1С-Онлайн | 79 | Служебные |
| 4340 | 3732 | 4584 | 1142.000000 | Grand Theft Auto V [PS3, русские субтитры] | 19 | Игры - PS3 |
| 4704 | 2808 | 4639 | 1080.000000 | Diablo III [PC, Jewel, русская версия] | 30 | Игры PC - Стандартные издания |
We can observe that first item is 16 times more popular then second one. It happened to be... foil bag.
How other 100 most popular items sale looks like.
Let's fast check if there is any outliers in column item_cnt_day.
<AxesSubplot:xlabel='item_cnt_day'>
There were 2 sales deviating from the norm, so i just drop them out.
So it is obvious that data have some patterns, we need only find them! Let check if difference have some pattern.
This plot shows how difference in total sales looks like. We can observe that for sure some patterns there exists. Let's use double exponential smoothing, which is mathematical tool which employs a level component and a trend component at each period. It uses two weights, (also called smoothing parameters), to update the components at each period. The double exponential smoothing equations are as follows:
$L_t = \alpha V_t + (1 - \alpha)[L_{t-1} + T_{t-1}]$
$T_t = \beta [L_{t} - L_{t-1}] + (1 - \beta)T_{t-1}$
$\Delta V_t = L_{t-1} + T_{t-1}$
where:
$L_t$ is level at time t, $T_t$ is trend at time t, $V_t$ real value at time t, $\Delta V_t$ predicted value step ahead t and $\alpha$ and $\beta$ are smoothing parameters.
At above graph we can observe how this mathematical function looks with different parameters. The best one is when $\alpha = 0.9$ and $\beta = 0.02$, and according to this prediction, we can expect slightly above 7k sales in November 2015.
The main tools for prediction that i choose is fbProphet, the Facebook prediction library.
Despite predicting value for select amount of time, model also predict confidence interval (in my case 95%).
prophet_model = Prophet(interval_width = 0.95, daily_seasonality=True)
prophet_model.fit(prophet_df)
future_dates = prophet_model.make_future_dataframe(periods = 30)
forecast = prophet_model.predict(future_dates)
10938.391902436291
| ds | y | |
|---|---|---|
| 0 | 2013-01-02 | 568.000000 |
| 1 | 2013-01-03 | 423.000000 |
| 2 | 2013-01-04 | 431.000000 |
| 3 | 2013-01-05 | 415.000000 |
| 4 | 2013-01-06 | 435.000000 |
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | daily | weekly | yearly | yearly_lower | yearly_upper | multiplicative_terms | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1056 | 2015-11-26 00:00:00 | 366.905184 | 78.387560 | 348.719990 | 366.746087 | 367.046675 | -154.222931 | -25.436868 | 24.795219 | 24.795219 | 24.795219 | 0.000000 | 212.040604 |
| 1057 | 2015-11-27 00:00:00 | 366.717532 | 147.096618 | 422.482069 | 366.549737 | 366.869749 | -154.222931 | 56.250589 | 18.109921 | 18.109921 | 18.109921 | 0.000000 | 286.855111 |
| 1058 | 2015-11-28 00:00:00 | 366.529880 | 187.814360 | 441.788749 | 366.348629 | 366.692796 | -154.222931 | 82.617337 | 12.063773 | 12.063773 | 12.063773 | 0.000000 | 306.988059 |
| 1059 | 2015-11-29 00:00:00 | 366.342228 | 74.392635 | 352.098128 | 366.144637 | 366.515844 | -154.222931 | -1.845000 | 6.915752 | 6.915752 | 6.915752 | 0.000000 | 217.190049 |
| 1060 | 2015-11-30 00:00:00 | 366.154576 | 12.880823 | 290.986657 | 365.941878 | 366.336427 | -154.222931 | -57.696038 | 2.913410 | 2.913410 | 2.913410 | 0.000000 | 157.149017 |
So, let's check what is total predicted sales in October, how wrong the model was, and predict sales in November, which is our final goal.
Conclusion:
We can excpet sales around the level of 7570.
We are 95% sure that sales are somewhere between 3450 and 11657
If we try to predict next month, December, our model give pretty good prediction
In December our model predict 10938 total sales
If we have already so precise data, we could try do the thing, that this data set was created for, so predict total sales in November specific items. For this task i choose LSTM(Long-Short Term Memory), which of i understand the concept of, but i don't know much more.
Let's start from making pivot table, where rows are items, and columns are 34 months, where 0 is 2013 January and 33 is 2015 October, and cell value is total sales of item I in month M.
| item_id | item_cnt_day | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date_block_num | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | |
| 0 | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 27 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 28 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 29 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 30 | 0 | 112 | 65 | 13 | 10 | 3 | 4 | 1 | 4 | 2 | 1 | 9 | 7 | 2 | 3 | 1 | 2 | 1 | 2 | 1 | 0 | 3 | 1 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Let's remove item_id column, because we want to predict at the basic of historical total sales.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 112 | 65 | 13 | 10 | 3 | 4 | 1 | 4 | 2 | 1 | 9 | 7 | 2 | 3 | 1 | 2 | 1 | 2 | 1 | 0 | 3 | 1 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
To train our model, we will use data from first 33 months, and the least one, October 2015, will be the target. I don't split data into train and validation, because build in option(size of validation = 30% of all data) do it for me.
My LSTM model will have 4 layers:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 33, 64) 16896 _________________________________________________________________ lstm_1 (LSTM) (None, 32) 12416 _________________________________________________________________ dropout (Dropout) (None, 32) 0 _________________________________________________________________ dense (Dense) (None, 1) 33 ================================================================= Total params: 29,345 Trainable params: 29,345 Non-trainable params: 0 _________________________________________________________________
Epoch 1/8 79/79 [==============================] - 24s 42ms/step - loss: 29.3462 - mean_squared_error: 29.3462 - val_loss: 1492.2611 - val_mean_squared_error: 1492.2611 Epoch 2/8 79/79 [==============================] - 2s 26ms/step - loss: 20.7860 - mean_squared_error: 20.7860 - val_loss: 5.9117 - val_mean_squared_error: 5.9117 Epoch 3/8 79/79 [==============================] - 2s 26ms/step - loss: 3.1615 - mean_squared_error: 3.1615 - val_loss: 113.4887 - val_mean_squared_error: 113.4887 Epoch 4/8 79/79 [==============================] - 2s 26ms/step - loss: 4.2574 - mean_squared_error: 4.2574 - val_loss: 61.0672 - val_mean_squared_error: 61.0672 Epoch 5/8 79/79 [==============================] - 2s 26ms/step - loss: 2.9142 - mean_squared_error: 2.9142 - val_loss: 2.0277 - val_mean_squared_error: 2.0277 Epoch 6/8 79/79 [==============================] - 2s 27ms/step - loss: 5.0791 - mean_squared_error: 5.0791 - val_loss: 69.4995 - val_mean_squared_error: 69.4995 Epoch 7/8 79/79 [==============================] - 2s 27ms/step - loss: 5.9779 - mean_squared_error: 5.9779 - val_loss: 60.4224 - val_mean_squared_error: 60.4224 Epoch 8/8 79/79 [==============================] - 2s 26ms/step - loss: 3.6242 - mean_squared_error: 3.6242 - val_loss: 184.8788 - val_mean_squared_error: 184.8788
<keras.callbacks.History at 0x1c18abe5588>
Let's check how looks prediction for few choosen items.
| item_name | predicted_amount | |
|---|---|---|
| 0 | ***КОРОБКА (СТЕКЛО) D | [1.] |
| 1 | 1С:Коллекция игрушек "Битва за Британию 2: Крылья Победы" DVD | [1.] |
| 2 | 3D Action Puzzle "Зомби" Уборщик | [0.] |
| 3 | Diablo III [PC, Jewel, русская версия] | [0.] |
| 4 | Dr.Web Security Space КЗ 1 ПК/1 год (картонная упаковка) | [0.] |
| 5 | World of Warcraft Gold [PC, Jewel, 14 дней, русская версия] | [1.] |
| 6 | X360: Геймпад беспроводной черный - Wireless Controller BLACK (NSF-00002: Microsoft) | [2.] |
| 7 | Прием денежных средств для 1С-Онлайн | [5.] |
| 8 | ХОББИТ: НЕЖДАННОЕ ПУТЕШЕСТВИЕ (регион) | [2.] |